Work stealing for GPU-accelerated parallel programs in a global address space framework
نویسندگان
چکیده
Task parallelism is an attractive approach to automatically load balance the computation in a parallel system and adapt to dynamism exhibited by parallel systems. Exploiting task parallelism through work stealing has been extensively studied in shared and distributed-memory contexts. In this paper, we study the design of a system that uses work stealing for dynamic load balancing of task-parallel programs executed on hybrid distributed-memory CPU-graphics processing unit (GPU) systems in a global-address space framework. We take into account the unique nature of the accelerator model employed by GPUs, the significant performance difference between GPU and CPU execution as a function of problem size, and the distinct CPU and GPU memory domains. We consider various alternatives in designing a distributed work stealing algorithm for CPU-GPU systems, while taking into account the impact of task distribution and data movement overheads. These strategies are evaluated using microbenchmarks that capture various execution configurations as well as the state-of-the-art CCSD(T) application module from the computational chemistry domain. Copyright © 2015 John Wiley & Sons, Ltd.
منابع مشابه
Hierarchical Work Stealing on Manycore Clusters
Partitioned Global Address Space languages like UPC offer a convenient way of expressing large shared data structures, especially for irregular structures that require asynchronous random access. But the static SPMD parallelism model of UPC does not support divide and conquer parallelism or other forms of dynamic parallelism. We introduce a dynamic tasking library for UPC that provides a simple...
متن کاملLarge-scale genome-wide association studies on a GPU cluster using a CUDA-accelerated PGAS programming model
Detecting epistasis, such as 2-SNP interactions, in Genome-Wide Association Studies (GWAS) is an important but time consuming operation. Consequently, GPUs have already been used to accelerate these studies, reducing the runtime for moderately-sized datasets to less than one hour. However, single-GPU approaches cannot perform large-scale GWAS in reasonable time. In this work we present multiEpi...
متن کاملResolutions of the Coulomb operator: VIII. Parallel implementation using the modern programming language X10
Use of the modern parallel programming language X10 for computing long-range Coulomb and exchange interactions is presented. By using X10, a partitioned global address space language with support for task parallelism and the explicit representation of data locality, the resolution of the Ewald operator can be parallelized in a straightforward manner including use of both intranode and internode...
متن کاملA Cross-Input Adaptive Framework for GPU Programs Optimization
Recent years have seen a trend in using graphic processing units (GPU) as accelerators for generalpurpose computing. The inexpensive, single-chip, massively parallel architecture of GPU has evidentially brought factors of speedup to many numerical applications. However, the development of a high-quality GPU application is challenging, thanks to the large optimization space and complex unpredict...
متن کاملOptimizing Partitioned Global Address Space Programs for Cluster Architectures
Optimizing Partitioned Global Address Space Programs for Cluster Architectures by Wei-Yu Chen Doctor of Philosophy in Computer Science University of California, Berkeley Professor Katherine A. Yelick, Chair Unified Parallel C (UPC) is an example of a partitioned global address space language for high performance parallel computing. This programming model enables application to be written in a s...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- Concurrency and Computation: Practice and Experience
دوره 28 شماره
صفحات -
تاریخ انتشار 2016